Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printf: improve support of printing multi-byte values of characters #7048

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sylvestre
Copy link
Contributor

No description provided.

@sylvestre sylvestre force-pushed the printf-go branch 7 times, most recently from 95a1333 to 80988c3 Compare January 1, 2025 17:46
Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@sylvestre sylvestre force-pushed the printf-go branch 3 times, most recently from 0e9e153 to 63eb50f Compare January 1, 2025 18:35
Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@sylvestre
Copy link
Contributor Author

@jtracey given your work on #7020 i am wondering if you would be able to help with the windows support ? :) thanks

#[cfg(windows)]
let format_vec: Vec<u8> = format
.encode_wide()
.flat_map(|wchar| wchar.to_le_bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't do what we want, because it's effectively casting UTF-16 into a byte array, and UTF-16 is not byte-compatible with UTF-8 (e.g., ASCII in UTF-8 is just the literal ASCII byte per char, while ASCII in UTF-16 will be two bytes each). Because invalid unicode is much rarer in Windows, the way to turn OsStr(ing)s into bytes for us is using os_str_as_bytes, which will error on invalid unicode on Windows, or os_str_as_bytes_lossy, which will turn invalid unicode sequences into the replacement character on Windows. Windows makes it difficult to pass invalid unicode as an argument, so whichever feels more appropriate should be fine here.

Also, a nit that won't be relevant after the change, but cfgs assuming unix or windows is a code smell, since there are platforms that are neither (none that we support yet, but I hope to change that at some point 😛).

.collect();
FormatArgument::Unparsed(
String::from_utf8(raw_bytes.clone())
.unwrap_or_else(|_| raw_bytes.iter().map(|&b| b as char).collect()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it documented anywhere, but collecting b as char has unpredictable behavior. Casting a u8 to char casts to the char with the code point with that value, not the UTF-8 parsing of that value. This was made invisible because somewhere, Rust seems to be doing some kind of dynamic dispatch to instead collect the bytes into a UTF-8 string if the whole iterator is valid UTF-8. Here's a playground if you want a more detailed demonstration.

Ideally, the FormatArgument::Unparsed value should contain bytes or an OsString instead of a String, but assuming that's outside the scope of this PR, it's better to specify what should happen if a non-UTF-8 string is passed in, using to_string or to_string_lossy, avoiding the encoding of values as bytes entirely for now (this also avoids allocation overhead, since char is the size of a u32).

Copy link

github-actions bot commented Jan 6, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants